Skip to content

Conversation

Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Oct 13, 2025

Minor refactors to unicode_data that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC

@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2025

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 13, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2025

r? @joboet

rustbot has assigned @joboet.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch from 2c5244e to 90adbe2 Compare October 13, 2025 01:07
Instead of `include_str!()`ing `range_search.rs`, just make it a normal
module under `core::unicode`. This means the same source code doesn't
have to be checked in twice, and it plays nicer with IDEs.

Also rename it to `rt` since it includes functions for searching the
bitsets as well as the range represesentation.
Run `rustfmt` on the generated tables. This means we won't have to worry
so much about getting indetation and formatting right when generating
code.

Exempted for now some tables which would be too big when formatted by
`rustfmt`.
This check was made redundant (it will always be true) when we removed
all ASCII characters from the tables
(rust-lang@a8c6694).
To make the final output code easier to see:
* Get rid of the unnecessary line-noise of `.unwrap()`ing calls to
  `write!()` by moving the `.unwrap()` into a macro.
* Join consecutive `write!()` calls using a single multiline format
  string.
* Replace `.push()` and `.push_str(format!())` with `write!()`.
* If after doing all of the above, there is only a single `write!()`
  call in the function, just construct the string directly with
  `format!()`.
@Kmeakin Kmeakin force-pushed the km/unicode-data/refactors branch from 90adbe2 to 1a646cf Compare October 13, 2025 20:31
@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.


table_file.push_str(&version());

table_file.push_str("use super::rt::*;\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nitpicking here, but could you move this import above the definition of the version const?

writeln!(s, "// ignore-tidy-filelength")?;
writeln!(s, "use std::intrinsics;")?;
writeln!(s, "mod unicode_data;")?;
writeln!(s, "mod rt {{ {} }}", include_str!("../../../../library/core/src/unicode/rt.rs"))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to problems if the rt module is moved or starts importing other things from core. Do you know why the test file isn't just generated as a submodule of unicode_data.rs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

library/core can't have any tests. All libcore tests must be in library/coretests instead. Trying to compile libcore in test mode will fail with duplicate lang items as libtest pulls in another copy of libcore.

}

fn rustfmt(path: &str) {
std::process::Command::new("rustfmt").arg(path).status().expect("rustfmt failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is rustfmt really always in PATH when this command is run? Otherwise, I think it'd be easier to slap a big #[rustfmt::skip] on the mod unicode_data.

}

static HEADER: &str = r"
const INDEX_MASK: u32 = 1 << 22;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is INDEX_MASK guaranteed to stay constant? Otherwise, I wouldn't duplicate the definition here, this is likely to go out of sync.

mod skiplist;
mod unicode_download;

pub use fmt_helpers::*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this pub?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants